TeYang Lau
Created: 4/11/2020
Last update: 12/11/2020

Exploratory Data Analysis
3.1. By Flat Type
3.2. By Town
3.3. By Storeys
3.4. By Floor Area
3.5. By Block Number
3.6. By Flat Model
3.7. By Lease Commence Date
3.8. By Distance to Nearest Amenities
3.9. By Number of Amenities in 2km Radius
Data Preparation
4.1. Missing Values
4.2. Multicollinearity
4.3. Normality
4.4. Label & Dummy Encoding
4.5. Feature Scaling
4.6. Outlier Detection
Linear Regression
5.1. Model Building & Results
5.2. Homoscedasticity and Normality of Residuals
5.3. Feature Importance
Random Forest
6.1. Out-Of-Bag
6.2. K-fold Cross Validation
6.3. Feature Importance
6.4. SHAP Values
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime as dt
price1999 = pd.read_csv('Data/resale-flat-prices-based-on-approval-date-1990-1999.csv')
price2012 = pd.read_csv('Data/resale-flat-prices-based-on-approval-date-2000-feb-2012.csv')
price2014 = pd.read_csv('Data/resale-flat-prices-based-on-registration-date-from-mar-2012-to-dec-2014.csv')
price2016 = pd.read_csv('Data/resale-flat-prices-based-on-registration-date-from-jan-2015-to-dec-2016.csv')
price2017 = pd.read_csv('Data/resale-flat-prices-based-on-registration-date-from-jan-2017-onwards.csv')
cpi = pd.read_csv('Data/CPI.csv')
price1999.head()
price2012.head()
price2014.head()
price2016.head()
# Merge dfs
prices = pd.concat([price1999, price2012, price2014], sort=False)
prices = pd.concat([prices, price2016, price2017], axis=0, ignore_index=True, sort=False)
prices['month'] = pd.to_datetime(prices['month']) # to datetime
prices.info()
prices[~prices.isnull().any(axis=1)]['month'].dt.year.unique()
remaining_lease has lots of NAs. They are only available after 2015 sales onwards.
# Clean flat type
prices['flat_type'] = prices['flat_type'].str.replace('MULTI-GENERATION', 'MULTI GENERATION')
prices['flat_type'].unique()
# Rename flat model duplicates
replace_values = {'NEW GENERATION':'New Generation', 'SIMPLIFIED':'Simplified', 'STANDARD':'Standard', 'MODEL A-MAISONETTE':'Maisonette', 'MULTI GENERATION':'Multi Generation', 'IMPROVED-MAISONETTE':'Executive Maisonette', 'Improved-Maisonette':'Executive Maisonette', 'Premium Maisonette':'Executive Maisonette', '2-ROOM':'2-room', 'MODEL A':'Model A', 'MAISONETTE':'Maisonette', 'Model A-Maisonette':'Maisonette', 'IMPROVED':'Improved', 'TERRACE':'Terrace', 'PREMIUM APARTMENT':'Premium Apartment', 'Premium Apartment Loft':'Premium Apartment', 'APARTMENT':'Apartment', 'Type S1':'Type S1S2', 'Type S2':'Type S1S2'}
prices = prices.replace({'flat_model': replace_values})
prices['flat_model'].value_counts()
Types of Flat Models:
Standard: (1/2/3/4/5-room). 1960s HDB. Had WC and shower in same room. 5-room Standard were introduced in 1974.
Improved: (1/2/3/4/5-room). Introduced in 1966, the 3/4-room having separate WC and shower, they also featured void decks. 5-room Improved were introduced in 1974.
New Generation: Started first in 1975, New Generation flats can be 3-Room (67 / 82 sqm) or 4-Room (92 sqm), featuring en-suite toilet for master bedroom, with pedestal type Water Closet, plus store room.
Model A: Introduced in 1981: 3-Room (75 sqm), 4-Room (105 sqm), 5-Room (135 sqm), 5-Room Maisonette (139 sqm)
Model A2: Smaller units of Model A. e.g., 4-Room Model A2 (90 sqm)
Simplified: Introduced in 1984: 3-Room (64 sqm), 4-Room (84 sqm)
Multi Generation: 3Gen flats designed to meet the needs of multi-generation families.
Maisonette: AKA Model A Maisonette — 2 storeys HDB flat
Premium Apartment: Introduced somewhere during 1990s, featuring better quality finishes, you get them in ready-to-move condition, with flooring, kitchen cabinets, built-in wardrobes
Executive Maisonette: More premium version of Model A Maisonettes. These units are no longer being built after being replaced by the Executive Condominium (EC) scheme in 1995
Executive Apartment: Executive Apartment / Maisonette (146-150 sqm) were introduced in 1983 and replaced 5-Room Model A flats, in addition of the 3-bedroom and separate living/dining found in 5A flats, EA and EM feature an utility/maid room. 80% of Executive units were Maisonettes and 20% were Apartments.
DBBS: public apartments built under the HDB's short-lived Design, Build and Sell Scheme (DBSS) from 2005 to 2012. They are a unique (and premium) breed of HDB flats in Singapore, which are built by private developers. High Prices. Quite similiar to Executive Condominium except DBBS is like a premium HDB without facilities of private condos
Adjoined Flat: Large HDB flats which are combined from 2 HDB flats
Terrace: HDB terrace flats built before HDB, without realizing Singapore's land constraint. Discontinued
Type S1S2: apartments at The Pinnacle@Duxton are classified as "S" or Special apartments in view of its historical significance and award-winning design. For application of HDB policies, S1 and S2 apartments will be treated as 4-room and 5-room flats respectively
2-room: Most likely refers to 2-room flexi where there is 1 bedroom and 1 common area
prices['storey_range'].unique()
prices['town'].unique()
plt.hist(prices['floor_area_sqm'], bins=50, edgecolor='black')
plt.title('Distribution of HDB Floor Area')
plt.show()
display(prices[prices['floor_area_sqm'] > 200]['flat_model'].value_counts())
The floor area outliers mostly belong to special HDBs that are larger than the standard ones. So they might not be outliers from a multivariate perspective.
bins = prices['lease_commence_date'].max() - prices['lease_commence_date'].min()
plt.hist(prices['lease_commence_date'], bins=bins, edgecolor='black')
plt.title('Distribution of Lease Commence Year')
plt.show()
# Compute Resale Price Adjusted for Inflation Using Consumer Price Index for Housing & Utilities
# https://www.singstat.gov.sg/find-data/search-by-theme/economy/prices-and-price-indices/latest-data
cpi['month'] = pd.to_datetime(cpi['month'], format='%Y %b') # to datetime
prices = prices.merge(cpi, on='month', how='left')
# https://people.duke.edu/~rnau/411infla.htm
prices['real_price'] = (prices['resale_price'] / prices['cpi']) * 100
# Plot Median Resale Prices Over the Years
# Unadjusted
fig = plt.figure(figsize=(14,4.5))
fig.suptitle('Median HDB Resale Prices Over the Years', fontsize=18)
ax1 = fig.add_subplot(121)
prices.groupby('month')[['resale_price']].median().plot(ax=ax1, color='#00cef6', legend=None)
ax1.set_xlabel('Date'), ax1.set_ylabel('Resale Price in SGD ($)'), ax1.set_ylim(0, 500000), ax1.set_title('Unadjusted for Inflation', size=15)
# Adjusted
# https://jakevdp.github.io/PythonDataScienceHandbook/04.09-text-and-annotation.html
ax2 = fig.add_subplot(122)
prices.groupby('month')[['real_price']].median().plot(ax=ax2, color='#3c78d8', legend=None)
ax2.set_xlabel('Date'), ax2.set_ylabel('Resale Price in SGD ($)'), ax2.set_ylim(0, 500000), ax2.set_title('Adjusted for Inflation to 2019 SGD',size=15)
ax2.annotate('1997 Asian Financial Crisis\nMedian: $403,766', xy=('1997-05-01',380000), xycoords='data',
bbox=dict(boxstyle="round4,pad=.5", fc="none", ec="#28324a"), xytext=(50,-140), textcoords='offset points', ha='center',
arrowprops=dict(arrowstyle="->", connectionstyle="angle,angleA=0,angleB=90,rad=20"))
ax2.annotate('2013 Cooling Measures\nMedian: $401,887', xy=('2013-07-01',380000), xycoords='data',
bbox=dict(boxstyle="round4,pad=.5", fc="none", ec="#28324a"), xytext=(0,-90), textcoords='offset points', ha='center',
arrowprops=dict(arrowstyle="->", connectionstyle="angle,angleA=0,angleB=90,rad=20"))
plt.tight_layout(rect=[0, 0, 0.9, 0.9])
# for ax, color in zip([ax1, ax2], ['#3c78d8', '#3c78d8']):
# plt.setp(tuple(ax.spines.values()), color=color)
# plt.setp([ax.get_xticklines(), ax.get_yticklines()], color=color)
plt.show()
#prices.set_index('month').loc['1997']['real_price'].median()
Following the collapse of the thai Baht in July 1997, housing prices in Singapore continue to fall and only started gradually increasing again around 2004. In 2013, it experienced a decline due to 'Propery Market Cooling Measures', such as the Additional Buyer's Stamp Duty (ABSD), Loan-to-Value (LTV) Ratio, and Total Debt Servicing Ratio (TDSR). Refer here for more information.
# Convert remaining_lease to number of years
def getYears(text):
if isinstance(text, str):
yearmonth = [int(s) for s in text.split() if s.isdigit()]
if len(yearmonth) > 1: # if there's year and month
years = yearmonth[0] + (yearmonth[1]/12)
else: # if only year
years = yearmonth[0]
return years
else: # if int
return text
prices['remaining_lease'] = prices['remaining_lease'].apply(lambda x: getYears(x))
prices.info()
bins = prices['remaining_lease'].max() - prices['remaining_lease'].min()
plt.hist(prices['remaining_lease'], bins=int(bins), edgecolor='black')
plt.title('Distribution of Remaining Lease for 2016-2020 Data')
plt.show()
prices.head()
## Waffle chart for flat type - number of rooms
from pywaffle import Waffle
flattype = dict(prices['flat_type'].value_counts()/len(prices)*100)
flattype1519 = dict(prices.set_index('month')['2015':'2019'].reset_index()['flat_type'].value_counts()/len(prices.set_index('month')['2015':'2019'].reset_index())*100)
plt.figure(figsize=(10,5),
FigureClass=Waffle,
plots={
'211': {
'values': flattype,
'legend': {'loc': 'upper left', 'bbox_to_anchor': (1.05, 1), 'fontsize':11},
'title': {'label': 'Proportion of HDB Flat Types (All Years)', 'loc': 'left', 'fontsize':16}
},
'212': {
'values': flattype1519,
'legend': {'loc': 'upper left', 'bbox_to_anchor': (1.05, 1), 'fontsize':11},
'title': {'label': 'Proportion of HDB Flat Types (2015-2019)', 'loc': 'left', 'fontsize':16}
},
},
rows=5,
colors=["#1f77b4", "#ff7f0e", "green", 'purple', 'black', 'yellow', 'brown'],
#colors=["#3c78d8", "#00cef6", "#aff000", '#28324a', 'black', 'yellow', 'brown'],
icons='home',
font_size=22,
icon_legend=True)
plt.show()
There are not many 1 Room, 2 Room and Multi Generation flats, so they will be removed for looking at flat types.
flattype = ['3 ROOM','4 ROOM','5 ROOM','EXECUTIVE']
prices1519 = prices.set_index('month').sort_index().loc['2015-01':'2019-12']
prices1519 = prices1519[prices1519['flat_type'].isin(flattype)][['flat_type','real_price']].reset_index()
prices1519['flat_type_year'] = prices1519['flat_type'] + ' - ' + prices1519['month'].apply(lambda x: str(x)[:4])
prices1519
# ridgeline plot for looking at distribution of flat types by year
import joypy
fig, axes = joypy.joyplot(prices1519, by="flat_type_year", column="real_price",figsize=(9,7),
linewidth=0.05,overlap=1.5,alpha=0.8,colormap=plt.cm.get_cmap('tab20',4))
axes[-1].set_xlim([0,1200000])
axes[-1].set_xticklabels(['0', '200k', '400k', '600k', '800k', '1000k', '1200k'])
plt.xlabel('Resale Price SGD ($)', fontsize=14)
fig.show()
We don't see much changes within each flat type through the last 5 years. The only consistent pattern is that prices become higher as flats have more rooms, which is unsurprising.
## 2015 to 2019
prices['year'] = pd.DatetimeIndex(prices['month']).year # extract out year
prices1519 = prices[prices['year'].isin([2015,2016,2017,2018,2019])].groupby(['town'], as_index=False).agg({'real_price': 'median'}).sort_values('real_price', ascending=True).reset_index(drop=True)
prices1519['real_price'] = round(prices1519['real_price']/1000)
prices1519['color'] = ['#f8766d'] + ['#3c78d8']*(len(prices1519)-2) + ['#00ba38']
# 4-room
prices1519_4room = prices[(prices['flat_type'].isin(['4 ROOM'])) & (prices['year'].isin([2015,2016,2017,2018,2019]))].groupby(['town'], as_index=False).agg({'real_price': 'median'}).sort_values('real_price', ascending=True).reset_index(drop=True)
prices1519_4room['real_price'] = round(prices1519_4room['real_price']/1000)
prices1519_4room['color'] = ['#f8766d','#f8766d'] + ['#3c78d8']*(len(prices1519_4room)-3) + ['#00ba38']
## 1997 vs 2019
# all room type
prices9719 = prices[prices['year'].isin([1997,2019])].groupby(['town','year'], as_index=False).agg({'real_price': 'median'})
prices9719['change'] = prices9719.groupby('town')['real_price'].apply(lambda x: x.pct_change()*100)
prices9719 = prices9719[prices9719['change'].notnull()]
prices9719 = prices9719.sort_values('change', ascending=True).reset_index(drop=True).reset_index()
prices9719['color'] = prices9719['change'].apply(lambda x: '#00ba38' if x > 0 else '#f8766d')
# 4-room
prices9719_4room = prices[(prices['flat_type'].isin(['4 ROOM']) & prices['year'].isin([1997,2019]))].groupby(['town','year'], as_index=False).agg({'real_price': 'median'})
prices9719_4room['change'] = prices9719_4room.groupby('town')['real_price'].apply(lambda x: x.pct_change()*100)
prices9719_4room = prices9719_4room[prices9719_4room.change.notnull()]
prices9719_4room = prices9719_4room.sort_values('change', ascending=True).reset_index(drop=True).reset_index()
prices9719_4room['color'] = prices9719_4room['change'].apply(lambda x: '#00ba38' if x > 0 else '#f8766d')
## 2018 vs 2019
# all room type
prices1819 = prices[prices['year'].isin([2018,2019])].groupby(['town','year'], as_index=False).agg({'real_price': 'median'})
prices1819['change'] = prices1819.groupby('town')['real_price'].apply(lambda x: x.pct_change()*100)
prices1819 = prices1819[prices1819['change'].notnull()]
prices1819 = prices1819.sort_values('change', ascending=True).reset_index(drop=True).reset_index()
prices1819['color'] = prices1819['change'].apply(lambda x: '#00ba38' if x > 0 else '#f8766d')
# 4-room
prices1819_4room = prices[(prices['flat_type'].isin(['4 ROOM']) & prices['year'].isin([2018,2019]))].groupby(['town','year'], as_index=False).agg({'real_price': 'median'})
prices1819_4room['change'] = prices1819_4room.groupby('town')['real_price'].apply(lambda x: x.pct_change()*100)
prices1819_4room = prices1819_4room[prices1819_4room.change.notnull()]
prices1819_4room = prices1819_4room.sort_values('change', ascending=True).reset_index(drop=True).reset_index()
prices1819_4room['color'] = prices1819_4room['change'].apply(lambda x: '#00ba38' if x > 0 else '#f8766d')
# Function for lollipop charts
def loll_plot(df, x, y, subtitle, xlabel, xlim):
plt.rc('axes', axisbelow=True)
plt.grid(linestyle='--', alpha=0.4)
plt.hlines(y=df.index, xmin=0, xmax=df[x], color=df.color, linewidth=1)
plt.scatter(df[x], df.index, color=df.color, s=300)
for i, txt in enumerate(df[x]):
plt.annotate(str(round(txt)), (txt, i), color='white', fontsize=9, ha='center', va='center')
plt.annotate(subtitle, xy=(1, 0), xycoords='axes fraction', fontsize=20,
xytext=(-5, 5), textcoords='offset points',
ha='right', va='bottom')
plt.yticks(df.index, df[y]); plt.xticks(fontsize=12); plt.xlim(xlim)
plt.xlabel(xlabel, fontsize=14)
fig = plt.figure(figsize=(12,7))
ax1 = plt.subplot(121)
loll_plot(prices1519, 'real_price', 'town', 'All Room Types', 'Resale Price (SGD)', [50,800])
ax1.set_xticklabels(['{:,.0f}'.format(x) + 'K' for x in ax1.get_xticks()])
ax1.yaxis.set_ticks_position('none')
ax2 = plt.subplot(122)
loll_plot(prices1519_4room, 'real_price', 'town', '4-Room', 'Resale Price (SGD)', [50,800])
ax2.set_xticklabels(['{:,.0f}'.format(x) + 'K' for x in ax2.get_xticks()])
ax2.yaxis.set_ticks_position('none')
fig.tight_layout(pad=0, rect=[0, 0, 0.9, 0.9])
plt.suptitle('2015 to 2019, Median Price of Flats', fontsize=16)
plt.show()
fig = plt.figure(figsize=(12,7))
ax1 = plt.subplot(121)
loll_plot(prices9719, 'change', 'town', 'All Room Types', 'Percent Change (%)', [-40,125])
ax2 = plt.subplot(122)
loll_plot(prices9719_4room, 'change', 'town', '4-Room', 'Percent Change (%)', [-40,125])
fig.tight_layout(pad=0, rect=[0, 0, 0.9, 0.9])
plt.suptitle('1997 vs 2019, Median Price of Flats', fontsize=16)
plt.show()
Queenstown appears to have one of the highest increases in resale price, which could be due to it being developed over the past 2 decades.
fig = plt.figure(figsize=(12,7))
ax1 = plt.subplot(121)
loll_plot(prices1819, 'change', 'town', 'All Room Types', 'Percent Change (%)', [-30,20])
ax2 = plt.subplot(122)
loll_plot(prices1819_4room, 'change', 'town', '4-Room', 'Percent Change (%)', [-30,20])
fig.tight_layout(pad=0.5, rect=[0, 0, 0.9, 0.9])
plt.suptitle('2018 vs 2019, Median Price of Flats', fontsize=16)
plt.show()
The changes are not very large from 2018 to 2019, although prices for Toa Payoh has dropped and Central Area 4-room flats have also dropped by 20%. Could it be because these areas have older HDB meaning their lease are now shorter? As shown below, it seems that places like Punggol, and Sengkang, tend to have later lease commence date, as they were developed later, which might have led to their slight increase in prices, while places like Toa Payoh and Central Area, tend to have older lease commence date.
prices[prices['year'].isin([2018,2019])].groupby('town')['lease_commence_date'].median().sort_values()
fig = plt.figure(figsize=(12,4))
# Storey Prices
ax1 = plt.subplot(121)
storey = prices.groupby('storey_range')['real_price'].median().reset_index().sort_values(by='storey_range')
storey['storey_rank'] = storey['storey_range'].astype('category').cat.codes # label encode
a=sns.scatterplot(x=storey['storey_rank'], y=storey['real_price'], s=storey['storey_rank'].astype('int')*30, color='#00994d', edgecolors='w', alpha=0.5, ax=ax1)
ylabels = ['{:,.0f}'.format(x) + 'K' for x in a.get_yticks()/1000]
ax1.set_yticklabels(ylabels)
ax1.set_xticklabels(pd.Series(['']).append(storey.iloc[[0,5,10,15,20,24],0]))
ax1.set_ylim([280000,1100000]), ax1.set_ylabel('Resale Price SGD ($)', size=15), ax1.set_xlabel('Storey', size=15)
ax1.set_title('All Years', size=15)
# Floor Area Prices
ax2 = plt.subplot(122)
storey2 = prices[prices['year'].isin([2015,2016,2017,2018,2019])].groupby('storey_range')['real_price'].median().reset_index().sort_values(by='storey_range')
storey2['storey_rank'] = storey2['storey_range'].astype('category').cat.codes
# Bubble chart
b=sns.scatterplot(x=storey2['storey_rank'], y=storey2['real_price'], s=storey2['storey_rank'].astype('int')*30, color='#00994d', edgecolors='w', alpha=0.5, ax=ax2)
ylabels = ['{:,.0f}'.format(x) + 'K' for x in ax2.get_yticks()/1000]
ax2.set_yticklabels(ylabels); ax2.set_ylabel('')
ax2.set_xticks([0,4,8,12,16])
ax2.set_xticklabels(storey2.iloc[[0,4,8,12,16],0])
ax2.set_ylim([280000,1100000]), ax2.set_xlabel('Storey', size=15)
ax2.set_title('2015 to 2019', size=15)
plt.show()
Surprisingly, this follows a linear relationship, with higher storeys being sold at a higher price.
# Floor Area Prices
area = prices[prices['year'].isin([2015,2016,2017,2018,2019])]
p=sns.regplot(x='floor_area_sqm', y='real_price', data=area, scatter_kws={"s": 1, 'alpha':0.5})
ylabels = ['{:,.0f}'.format(x) + 'K' for x in p.get_yticks()/1000]
p.set_yticklabels(ylabels)
p.set_ylabel('Resale Price SGD ($)', size=15)
p.set_xlabel('Floor Area (Square Meters)', size=15)
plt.close()

display(area[area['floor_area_sqm'] > 200])
Those cases on top right of the chart consists of flats that are either Terrace or Executive Maisonette, which is not surprising.
3 digit system was introduced in the 1970s, with the 1st digit representing a neighbourhood in a town. So for e.g., AMK neighbourhood 1 starts with 101, and AMK neighbourhood 2 starts with 201. So first digit was separated from last 2 digits and plotted separately
import re
# Block Number Prices
get_num = lambda x: int(re.findall("\d+", x)[0])
prices['blocknum'] = prices['block'].apply(get_num) # get only digits from block number
tmp = prices[prices['blocknum'] > 99] # get only blocks that use 3-digit numbering system
tmp = tmp.groupby('blocknum')['real_price'].median().reset_index()
# Scatterplots
fig = plt.figure(figsize=(12,4))
ax1 = plt.subplot(121)
a=sns.scatterplot(x=tmp['blocknum'].apply(lambda x: int(str(x)[0])), y=tmp['real_price'], color='#ff9933', edgecolors='w', alpha=0.9)
ylabels = ['{:,.0f}'.format(x) + 'K' for x in a.get_yticks()/1000]
ax1.set_yticklabels(ylabels)
ax1.set_ylabel('Resale Price SGD ($)', size=15), ax1.set_xlabel('Neighbourhood Number', size=15)
ax2 = plt.subplot(122)
b=sns.scatterplot(x=tmp['blocknum'].apply(lambda x: int(str(x)[1:])), y=tmp['real_price'], edgecolors='w', alpha=0.9)
ax2.set_yticklabels(ylabels)
ax2.set_ylabel('', size=15)
ax2.set_xlabel('Block Number', size=15)
plt.show()
Block number doesn't seem to influence prices.
# Violin plots for price distribution of each flat model
fig = plt.figure(figsize=(12,7))
p=sns.violinplot(x='flat_model', y='real_price', data=prices, width=1,
order=prices.groupby('flat_model')['real_price'].median().sort_values().reset_index()['flat_model'].tolist())
p.set_xticklabels(p.get_xticklabels(), rotation=30, ha='right'), p.set_xlabel('Flat Models', size=15)
ylabels = ['{:,.0f}'.format(x) + 'K' for x in p.get_yticks()/1000]
p.set_yticklabels(ylabels)
p.set_ylabel('Resale Price SGD ($)', size=15)
plt.show()
The special models like the Type S1S2 (The Pinnacle@Duxton) and Terrace tend to fetch higher prices while the older models from the 1900s tend to go lower.
# ridgeline plot
# tmp = prices.set_index('flat_model')
# tmp = tmp.loc[prices.groupby('flat_model')['real_price'].median().sort_values().reset_index()['flat_model'].tolist()].reset_index().groupby("flat_model", sort=False)
# fig, axes = joypy.joyplot(tmp, by="flat_model", column="real_price",figsize=(12,5),
# linewidth=0.05,overlap=1.5,alpha=0.8,colormap=plt.cm.get_cmap('tab20',16))
# axes[-1].set_xlim([-50000,1400000])
# axes[-1].set_xticklabels(['0', '200k', '400k', '600k', '800k', '1000k', '1200k', '1400k'])
# plt.xlabel('Resale Price SGD ($)', fontsize=14)
# fig.show()
# Boxplot for each year of lease commence date
fig = plt.figure(figsize=(7,9))
p=sns.boxplot(y='lease_commence_date', x='real_price', data=prices, width=1, orient='h', flierprops = dict(markerfacecolor = 'red', markersize = 0.1, linestyle='none'), linewidth=0.4)
p.set_xlabel('Resale Price SGD ($)', size=15), p.set_ylabel('Lease Commence Year', size=15)
xlabels = ['{:,.0f}'.format(x) + 'K' for x in p.get_xticks()/1000]
p.set_xticklabels(xlabels)
p.set_title('Resale Price By Lease Commence Year', size=15)
plt.show()
tmp = prices[prices['year'].isin([2015,2016,2017,2018,2019])]
fig, axes = joypy.joyplot(tmp, by="lease_commence_date", column="real_price",figsize=(6,10),
linewidth=1,overlap=5,alpha=0.8,colormap=plt.cm.get_cmap('tab20',16))
axes[-1].set_xlim([-50000,1400000])
axes[-1].set_xticklabels(['0', '200k', '400k', '600k', '800k', '1000k', '1200k', '1400k'])
plt.xlabel('Resale Price SGD ($)', fontsize=14)
fig.show()
The names of schools, supermarkets, hawkers, shopping malls, parks and MRTs were downloaded/scraped from Data.gov.sg and Wikipedia and fed through a function that uses OneMap.sg api to get their coordinates (latitude and longitude). These coordinates were then fed through other functions which uses geopy package to get the distance between locations. By doing this, the nearest distance of each amenity from each house can be computed, as well as the number of each amenity within a 2km radius of each flat.
The script for this can be found here.
Some of Kallang/Whampoa has no distance due to search error on OneMap.sg.
flat_amenities = pd.read_csv('Data/flat_amenities.csv')
# merge amenities data to flat data
prices1519 = prices[prices['year'].isin([2015,2016,2017,2018,2019])]
prices1519['flat'] = prices['block'] + ' ' + prices['street_name']
prices1519 = prices1519.merge(flat_amenities, on='flat', how='left')
# reduce number of class of town to regions
d_region = {'ANG MO KIO':'North East', 'BEDOK':'East', 'BISHAN':'Central', 'BUKIT BATOK':'West', 'BUKIT MERAH':'Central',
'BUKIT PANJANG':'West', 'BUKIT TIMAH':'Central', 'CENTRAL AREA':'Central', 'CHOA CHU KANG':'West',
'CLEMENTI':'West', 'GEYLANG':'Central', 'HOUGANG':'North East', 'JURONG EAST':'West', 'JURONG WEST':'West',
'KALLANG/WHAMPOA':'Central', 'MARINE PARADE':'Central', 'PASIR RIS':'East', 'PUNGGOL':'North East',
'QUEENSTOWN':'Central', 'SEMBAWANG':'North', 'SENGKANG':'North East', 'SERANGOON':'North East', 'TAMPINES':'East',
'TOA PAYOH':'Central', 'WOODLANDS':'North', 'YISHUN':'North'}
prices1519['region'] = prices1519['town'].map(d_region)
colors = {'North East':'Purple', 'East':'Green', 'Central':'Brown', 'West':'Red', 'North':'Orange'}
# get median info of each town
tmp = prices1519.groupby('town')[['dist_dhoby','school_dist','num_school_2km','hawker_dist','num_hawker_2km','park_dist','num_park_2km','mall_dist','num_mall_2km','mrt_dist','num_mrt_2km','supermarket_dist','num_supermarket_2km','real_price']].median().reset_index()
tmp['region'] = tmp['town'].map(d_region)
# Scatterplot with names of towns
fig, ax = plt.subplots(figsize=(8,5))
grouped = tmp.groupby('region')
for key, group in grouped:
group.plot(ax=ax, kind='scatter', x='dist_dhoby', y='real_price', label=key, color=colors[key], s=60)
b, a = np.polyfit(tmp['dist_dhoby'], tmp['real_price'], 1)
ax.plot(tmp['dist_dhoby'], a + b* tmp['dist_dhoby'], '-')
ax.set_xlim([0,20]), ax.set_xlabel('Distance from Dhoby Ghaut MRT (km)', size=15)
ylabels = ['{:,.0f}'.format(x) + 'K' for x in ax.get_yticks()/1000]
ax.set_yticklabels(ylabels), ax.set_ylabel('Resale Price SGD ($)', size=15)
for i, txt in enumerate(tmp['town']):
ax.annotate(txt, (tmp['dist_dhoby'][i]+0.3, tmp['real_price'][i]), size=8, alpha=0.9)
plt.show()
prices1519.groupby('region')['real_price'].median()
Relationship is negative, with flats that are further away from Dhoby Ghaut MRT (Central), having lower resale prices.
# scatterplot for median price of each town against nearest distance from each amenity
p=sns.pairplot(tmp, x_vars=["school_dist", "hawker_dist", "park_dist", "mall_dist", "mrt_dist", "supermarket_dist"], y_vars=["real_price"], height=3, aspect=1, kind="reg", plot_kws=dict(scatter_kws=dict(s=40)))
axes=p.axes
ylabels = ['{:,.0f}'.format(x) + 'K' for x in axes[0,0].get_yticks()/1000]
axes[0,0].set_yticklabels(ylabels), axes[0,0].set_ylabel('Resale Price SGD ($)', size=10)
axes[0,0].set_xlabel('Distance From School (km)', size=10), axes[0,1].set_xlabel('Distance From Hawker (km)', size=10)
axes[0,2].set_xlabel('Distance From Park (km)', size=10), axes[0,3].set_xlabel('Distance From Mall (km)', size=10)
axes[0,4].set_xlabel('Distance From MRT (km)', size=10), axes[0,5].set_xlabel('Distance From Supermarket (km)', size=10)
plt.suptitle('Resale Price (Median of Each Town) VS Distance from Nearest Amenities (Median of Each Town)')
plt.tight_layout(pad=0, rect=[0, 0, 0.9, 0.9])
plt.show()
Marine Parage is far from MRT station.
# scatterplot for price of each flat against nearest distance from each amenity
p=sns.pairplot(prices1519[prices1519['school_dist']<3], x_vars=["school_dist", "hawker_dist", "park_dist", "mall_dist", "mrt_dist", "supermarket_dist"], y_vars=["real_price"], height=3, aspect=1, kind="reg", plot_kws=dict(scatter_kws=dict(s=0.5,alpha=0.3), line_kws=dict(color='#ff7f0e'))) # remove outliers (>3km)
axes=p.axes
ylabels = ['{:,.0f}'.format(x) + 'K' for x in axes[0,0].get_yticks()/1000]
axes[0,0].set_yticklabels(ylabels), axes[0,0].set_ylabel('Resale Price SGD ($)', size=10)
axes[0,0].set_xlabel('Distance From School (km)', size=10), axes[0,1].set_xlabel('Distance From Hawker (km)', size=10)
axes[0,2].set_xlabel('Distance From Park (km)', size=10), axes[0,3].set_xlabel('Distance From Mall (km)', size=10)
axes[0,4].set_xlabel('Distance From MRT (km)', size=10), axes[0,5].set_xlabel('Distance From Supermarket (km)', size=10)
plt.suptitle('Resale Price VS Distance from Nearest Amenities')
plt.tight_layout(pad=0, rect=[0, 0, 0.9, 0.9])
plt.close()

Surprisingly, the relationship with school is not as strong as expected, while distance from hawker is more pronounced from the median plots.
# scatterplot for median price of each town against number of amenities
p=sns.pairplot(tmp, x_vars=["num_school_2km", "num_hawker_2km", "num_park_2km", "num_mall_2km", "num_mrt_2km", "num_supermarket_2km"], y_vars=["real_price"], height=3, aspect=1, kind="reg", plot_kws=dict(scatter_kws=dict(s=40)))
axes=p.axes
ylabels = ['{:,.0f}'.format(x) + 'K' for x in axes[0,0].get_yticks()/1000]
axes[0,0].set_yticklabels(ylabels), axes[0,0].set_ylabel('Resale Price SGD ($)', size=10)
axes[0,0].set_xlabel('Number of Schools', size=10), axes[0,1].set_xlabel('Number of Hawkers', size=10)
axes[0,2].set_xlabel('Number of Parks', size=10), axes[0,3].set_xlabel('Number of Malls', size=10)
axes[0,4].set_xlabel('Number of MRTs', size=10), axes[0,5].set_xlabel('Number of Supermarkets', size=10)
plt.suptitle('Resale Price (Median of Each Town) VS Number of Amenities in 2km Radius (Median of Each Town)')
plt.tight_layout(pad=0, rect=[0, 0, 0.9, 0.9])
plt.show()
# scatterplot for price of each flat against number of amenities
p=sns.pairplot(prices1519, x_vars=["num_school_2km", "num_hawker_2km", "num_park_2km", "num_mall_2km", "num_mrt_2km", "num_supermarket_2km"], y_vars=["real_price"], height=3, aspect=1, kind="reg", plot_kws=dict(scatter_kws=dict(s=0.5,alpha=0.3), line_kws=dict(color='#ff7f0e')))
axes=p.axes
ylabels = ['{:,.0f}'.format(x) + 'K' for x in axes[0,0].get_yticks()/1000]
axes[0,0].set_yticklabels(ylabels), axes[0,0].set_ylabel('Resale Price SGD ($)', size=10)
axes[0,0].set_xlabel('Number of Schools', size=10), axes[0,1].set_xlabel('Number of Hawkers', size=10)
axes[0,2].set_xlabel('Number of Parks', size=10), axes[0,3].set_xlabel('Number of Malls', size=10)
axes[0,4].set_xlabel('Number of MRTs', size=10), axes[0,5].set_xlabel('Number of Supermarkets', size=10)
plt.suptitle('Resale Price VS Number of Amenities in 2km Radius')
plt.tight_layout(pad=0, rect=[0, 0, 0.9, 0.9])
plt.close()

Punggol has its own LRT and Central Area has lots of stations and malls nearby.
# clear unused variables
del(price1999, price2012, price2014, price2016, price2017, prices1819, prices1819_4room, prices9719, prices9719_4room,
storey, storey2, tmp, xlabels, ylabels, p, grouped, flattype2019, flattype, flat_amenities, cpi, ax, ax1, ax2, area)
import gc
gc.collect()
Replace missing distance values with median of the town. Only Kallang/Whampoa has missing data, so the function below will replace them with the median of the Kallang/Whampoa distance variables.
df = prices1519[['town', 'flat_type', 'storey_range', 'floor_area_sqm', 'flat_model', 'lease_commence_date', 'year', 'school_dist', 'num_school_2km', 'hawker_dist', 'num_hawker_2km', 'park_dist', 'num_park_2km', 'mall_dist', 'num_mall_2km', 'mrt_dist', 'num_mrt_2km', 'supermarket_dist', 'num_supermarket_2km', 'dist_dhoby', 'region', 'real_price']]
# function for replacing NAs with median of the town
def replace_NA_median(df, columns):
for c in columns:
df[c] = df.groupby("town").transform(lambda x: x.fillna(x.median()))[c]
return df
df = replace_NA_median(df, ['school_dist', 'num_school_2km', 'hawker_dist',
'num_hawker_2km', 'park_dist', 'num_park_2km', 'mall_dist',
'num_mall_2km', 'mrt_dist', 'num_mrt_2km', 'supermarket_dist',
'num_supermarket_2km', 'dist_dhoby'])
df.info()
Multicollinearity occurs when two or more independent variables are highly correlated with one another in a regression model. Multicollinearity can be a problem in a regression model because we would not be able to distinguish between the individual effects of the independent variables on the dependent variable.
If our goal is one of classification or prediction and not to look at the importance or contribution of each feature, we do not need to deal with multicollinearity as it does not affect the overall prediction or goodness of fit. It just affects the p-value and the coefficients. But since we are interested in feature importance by looking at the coefficients of the linear regression model, we have to focus on it.
# Correlation heatmap
fig = plt.figure(figsize=(10,10))
ax = sns.heatmap(df.select_dtypes(include=['int64','float64']).corr(), annot = True, fmt='.2g',
vmin=-1, vmax=1, center= 0, cmap= 'coolwarm_r', linecolor='black', linewidth=1, annot_kws={"size": 7})
#ax.set_ylim(0 ,5)
plt.xticks(rotation=45, ha='right')
plt.title('Correlation Heatmap')
fig.show()
# Multicollinearity
# Import library for VIF
from statsmodels.stats.outliers_influence import variance_inflation_factor
def calc_vif(X):
# Calculating VIF
vif = pd.DataFrame()
vif["variables"] = X.columns
vif["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif['tolerance'] = 1/vif.VIF
vif['meanVIF'] = vif.VIF.mean()
return(vif)
calc_vif(df.drop('real_price',axis=1).select_dtypes(include=['int64','float64']))
calc_vif(df.drop(['real_price','num_supermarket_2km','year','num_school_2km','dist_dhoby'],axis=1).select_dtypes(include=['int64','float64']))
# drop columns
lr_df = df.drop(['num_supermarket_2km','year','num_school_2km','dist_dhoby'], axis=1)
There is definitely multicollinearity in our continuous features. Even after removing several features, the highest VIF is still around 50 (lease_commence_date) and floor_area_sqm is above 10 as well. I chose to left them in as I believe that they are really important contributors to resale prices. One way is to refit another model without them and compare the output.
# Plot distribution for each continuous variable
lr_df.hist(bins=50, figsize=(15, 10), grid=False, edgecolor='black')
plt.tight_layout(pad=0, rect=[0, 0, 0.9, 0.9])
plt.show()
Not all the variables follow a normal distribution, and most of the distances variables have some outliers. For these outliers, its better to use a multivariable approach like mahalanobis distance to see if they are outliers instead of using a univariate approach. If needed, some of these variables can be transformed to reduce the skewness.
# plot qqplot before and after log transformation
from statsmodels.api import qqplot
fig, ((ax1,ax2), (ax3,ax4)) = plt.subplots(2,2,figsize=(5,5))
ax1.hist(lr_df['real_price'], bins=50, edgecolor='black')
qqplot(lr_df['real_price'], line='s', ax=ax2)
ax3.hist(np.log(lr_df['real_price']), bins=50, edgecolor='black')
qqplot(np.log(lr_df['real_price']), line='s', ax=ax4)
plt.suptitle('Real Price Normality Before (Top) & After (Bottom) Logging')
plt.tight_layout(pad=0, rect=[0, 0, 0.9, 0.9])
plt.close()

We will fit the linear regression first and come back to check on the normality of the residuals as well as homoscedasticity.
# Frequency plots for catergorical features
fig = plt.figure(figsize=(30,5))
for count, col in enumerate(df.select_dtypes(include=['category','object']).columns):
fig.add_subplot(1,5,count+1)
df[col].value_counts().plot.barh()
plt.title(col)
plt.yticks(rotation=45)
plt.tight_layout(pad=0, rect=[0, 0, 0.9, 0.9])
Region might be a better choice instead of town since town has lots of classes and one-hot encoding it might lead to a very sparse matrix. For flat_type, we can remove Multi Generation and 1 Room since there are not many instances of them. storey_range will be label encoded according to their levels while flat_model should be further grouped to reduce the number of classes.
# label encode storeys
df = df.sort_values(by='storey_range')
df['storey_range'] = df['storey_range'].astype('category').cat.codes # label encode
lr_df = lr_df.sort_values(by='storey_range')
lr_df['storey_range'] = lr_df['storey_range'].astype('category').cat.codes # label encode
# remove flat types with very few cases
df = df[~df['flat_type'].isin(['MULTI GENERATION', '1 ROOM'])]
lr_df = lr_df[~lr_df['flat_type'].isin(['MULTI GENERATION', '1 ROOM'])]
# Re-categorize flat model to reduce num classes
replace_values = {'Executive Maisonette':'Maisonette', 'Terrace':'Special', 'Adjoined flat':'Special',
'Type S1S2':'Special', 'DBSS':'Special', 'Model A2':'Model A', 'Premium Apartment':'Apartment', 'Improved':'Standard', 'Simplified':'Model A', '2-room':'Standard'}
df = df.replace({'flat_model': replace_values})
lr_df = lr_df.replace({'flat_model': replace_values})
# Label encode flat type
replace_values = {'2 ROOM':0, '3 ROOM':1, '4 ROOM':2, '5 ROOM':3, 'EXECUTIVE':4}
df = df.replace({'flat_type': replace_values})
lr_df = lr_df.replace({'flat_type': replace_values})
df = df.reset_index(drop=True)
display(df['flat_model'].value_counts())
lr_df = lr_df.reset_index(drop=True)
display(lr_df['flat_model'].value_counts())
Storey_range and flat_type were label encoded since they are ordinal.
display(lr_df.head())
## dummy encoding
df = pd.get_dummies(df, columns=['region'], prefix=['region'], drop_first=True) # central is baseline
df = pd.get_dummies(df, columns=['flat_model'], prefix=['model'])
df= df.drop('model_Standard',axis=1) # remove standard, setting it as the baseline
lr_df = pd.get_dummies(lr_df, columns=['region'], prefix=['region'], drop_first=True) # central is baseline
lr_df = pd.get_dummies(lr_df, columns=['flat_model'], prefix=['model'])
lr_df= lr_df.drop('model_Standard',axis=1) # remove standard, setting it as the baseline
Region and flat_model were dummy encoded, with Central region and Standard model selected as the baseline to which other classes are compared to.
Scaling is only done for linear regression. Tree-based models do not require scaling as it does not affect performance.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
# fit to continuous columns and transform
scaled_columns = ['floor_area_sqm','lease_commence_date','school_dist','hawker_dist','num_hawker_2km','park_dist',
'num_park_2km', 'mall_dist', 'num_mall_2km', 'mrt_dist', 'num_mrt_2km', 'supermarket_dist']
scaler.fit(lr_df[scaled_columns])
scaled_columns = pd.DataFrame(scaler.transform(lr_df[scaled_columns]), index=lr_df.index, columns=scaled_columns)
# separate unscaled features
unscaled_columns = lr_df.drop(scaled_columns, axis=1)
# concatenate scaled and unscaled features
lr_df = pd.concat([scaled_columns,unscaled_columns], axis=1)
display(lr_df.head())
Using Cook's Distance as Mahalanobis Distance required too much memory for the large dataset. The threshold for Cook's Distance used here is 4/n.
from yellowbrick.regressor import CooksDistance
lr_y = lr_df[['real_price']]
lr_X = lr_df.drop(['real_price','town'], axis=1)
yy = np.log(lr_y)['real_price']
XX = lr_X.values
visualizer = CooksDistance()
visualizer.fit(XX, yy)
#visualizer.show()
plt.close()

from yellowbrick.regressor import ResidualsPlot
# visualize residuals before outlier removal
model = LinearRegression()
visualizer_residuals = ResidualsPlot(model)
visualizer_residuals.fit(XX, yy)
#visualizer_residuals.show()
plt.close()

# remove outliers
i_less_influential = (visualizer.distance_ <= visualizer.influence_threshold_)
X_li, y_li = XX[i_less_influential], yy[i_less_influential]
lr_X, lr_y = lr_X[i_less_influential], lr_y[i_less_influential]
# visualize residuals after outliers removal
model = LinearRegression()
visualizer_residuals = ResidualsPlot(model)
visualizer_residuals.fit(X_li, y_li)
#visualizer_residuals.show()
plt.close()

After removing outliers (~5000, 5.24%), the homoscedaticity became better. The residuals are normally distributed around 0, satisfying the linearity and normality assumptions of the linear model.
from sklearn.linear_model import LinearRegression
# sklearn method, which doesn't give much additional info
lin_reg = LinearRegression()
lin_reg.fit(lr_X, np.log(lr_y))
print(f'Coefficients: {lin_reg.coef_}')
print(f'Intercept: {lin_reg.intercept_}')
print(f'R^2 score: {lin_reg.score(lr_X, np.log(lr_y))}')
# statsmodel method, which gives more info
import statsmodels.api as sm
# alternate way using statistical formula, which does not require dummy coding manually
# https://stackoverflow.com/questions/50733014/linear-regression-with-dummy-categorical-variables
# https://stackoverflow.com/questions/34007308/linear-regression-analysis-with-string-categorical-features-variables
X_constant = sm.add_constant(lr_X)
lin_reg = sm.OLS(np.log(lr_y),X_constant).fit()
lin_reg.summary()
Since the sample is huge, most variables/features will be significant although they might not be practically significant.
# scatterplot of y (observed) and yhat (predicted)
plt.style.use('default')
fig = plt.figure(figsize=(5,3))
ax = sns.scatterplot(x=np.log(lr_y)['real_price'], y=lin_reg.predict(), edgecolors='w', alpha=0.9, s=8)
ax.set_xlabel('Observed')#, ax.set_xticklabels(['{:,.0f}'.format(x) + 'K' for x in ax.get_xticks()/1000])
ax.set_ylabel('Predicted')#, ax.set_yticklabels(['{:,.0f}'.format(x) + 'K' for x in ax.get_yticks()/1000])
ax.annotate('Adjusted R\u00b2: ' + str(format(round(lin_reg.rsquared_adj,2),'.2f')), xy=(0, 1), xytext=(25, -25),
xycoords='axes fraction', textcoords='offset points', fontsize=12)
plt.close()

The selected features are able to account for 90% of the variance in HDB resale prices.
# Homoscedasticity and Normality of Residuals
pred = lin_reg.predict()
resids = lin_reg.resid
resids_studentized = lin_reg.get_influence().resid_studentized_internal
fig = plt.figure(figsize=(10,3))
ax1 = plt.subplot(121)
sns.scatterplot(x=pred, y=resids_studentized, edgecolors='w', alpha=0.9, s=8)
ax1.set_xlabel('Predicted Values')
ax1.set_ylabel('Studentized Residuals')
ax2 = plt.subplot(122)
sns.distplot(resids_studentized, norm_hist=True, hist_kws=dict(edgecolor='w'))
ax2.set_xlabel('Studentized Residual')
plt.close()

Homoscedaticity appears to be satisfied. The residuals are normally distributed around 0, satisfying the linearity and normality assumptions of the linear model.
# Feature Importance
lr_results = pd.read_html(lin_reg.summary().tables[1].as_html(),header=0,index_col=0)[0]
coefs = lr_results[['coef']][1:].reset_index().rename(columns={'index':'feature'})
coefs['feature_importance'] = np.abs(coefs['coef'])
coefs = coefs.sort_values('feature_importance').reset_index(drop=True)
coefs['color'] = coefs['coef'].apply(lambda x: '#66ff8c' if x>0 else '#ff8c66')
coefs.plot.barh(x='feature',y='feature_importance',color=coefs['color'],figsize=(4,5))
colors = {'Positive':'#66ff8c', 'Negative':'#ff8c66'}
labels = list(colors.keys())
handles = [plt.Rectangle((0,0),1,1, color=colors[label]) for label in labels]
legend = plt.legend(handles, labels, title='Relationship', fontsize = '15')
plt.setp(legend.get_title(),fontsize='17')
plt.xlabel('Standardized Coefficients', size=15), plt.ylabel('Features', size=15)
plt.ylim([-1,23])
plt.title('Linear Regression Feature Importance', size=15)
plt.show()
It seems that region is the feature that drives resale prices the most. The Central region was used as the baseline against which all other regions are compared to. This means that all the other regions tend to be sold lower as compared to the Central area. Floor area, lease commence date and special flat models are also positive drivers of resale prices while distance to nearest hawker center is a negative driver.
Let's look at whether a non-linear model will show consistent drivers for resale prices.
from sklearn.model_selection import train_test_split
# Train Test Split
y = df[['real_price']]
X = df.drop(['real_price','town', 'year'], axis=1)
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.1, shuffle=True, random_state=0)
print('Shape of X_train:', X_train.shape)
print('Shape of X_test:', X_test.shape)
print('Shape of y_train:', y_train.shape)
print('Shape of y_test:', y_test.shape)
Here, I used a ratio of 9:1 for the train and test set and used both Out-Of-Bag and K-fold Cross Validation as validation methods.
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score, mean_absolute_error
from scipy.stats import spearmanr, pearsonr
# Validation using out-of-bag method
rf = RandomForestRegressor(n_estimators=100, oob_score=True, random_state=0)
rf.fit(X_train, y_train)
predicted_train = rf.predict(X_train)
print(f'Out-of-bag R\u00b2 score estimate: {rf.oob_score_:>5.3}')
# predict and get evaluation metrics on test set
predicted_test = rf.predict(X_test)
oob_test_score = r2_score(y_test['real_price'], predicted_test)
spearman = spearmanr(y_test['real_price'], predicted_test)
pearson = pearsonr(y_test['real_price'], predicted_test)
oob_mae = mean_absolute_error(y_test['real_price'], predicted_test)
print(f'Test data R\u00b2 score: {oob_test_score:>5.3}')
print(f'Test data Spearman correlation: {spearman[0]:.3}')
print(f'Test data Pearson correlation: {pearson[0]:.3}')
print(f'Test data Mean Absolute Error: {round(oob_mae)}')
from sklearn.model_selection import GridSearchCV
# validation by k-fold cross validation with grid search for best hyperparameters
# hyperparameter values shown below are the tuned final values
param_grid = {
'max_features': ['auto'], # max number of features considered for splitting a node
'max_depth': [20], # max number of levels in each decision tree
'min_samples_split': [15], # min number of data points placed in a node before the node is split
'min_samples_leaf': [2]} # min number of data points allowed in a leaf node
rfr =GridSearchCV(RandomForestRegressor(n_estimators = 500, n_jobs=-1, random_state=0),
param_grid, cv=10, scoring='r2', return_train_score=True)
rfr.fit(X_train,y_train)
print("Best parameters set found on Cross Validation:\n\n", rfr.best_params_)
print("\nCross Validation R\u00b2 score:\n\n", rfr.best_score_.round(3))
# predict and get evaluation metrics for test set
cv_predicted_test = rfr.predict(X_test)
cv_test_score = r2_score(y_test['real_price'], cv_predicted_test)
spearman = spearmanr(y_test['real_price'], cv_predicted_test)
pearson = pearsonr(y_test['real_price'], cv_predicted_test)
cv_mae = mean_absolute_error(y_test['real_price'], cv_predicted_test)
print(f'Test data R\u00b2 score: {cv_test_score:>5.3}')
print(f'Test data Spearman correlation: {spearman[0]:.3}')
print(f'Test data Pearson correlation: {pearson[0]:.3}')
print(f'Test data Mean Absolute Error: {round(cv_mae)}')
# scatterplots of y (observed) and yhat (predicted)
fig = plt.figure(figsize=(13,4))
ax1 = plt.subplot(121)
ax1 = sns.scatterplot(x=y_test['real_price'], y=predicted_test, edgecolors='w', alpha=0.9, s=8)
ax1.set_xlabel('Observed'), ax1.set_xticklabels(['{:,.0f}'.format(x) + 'K' for x in ax1.get_xticks()/1000])
ax1.set_ylabel('Predicted'), ax1.set_yticklabels(['{:,.0f}'.format(x) + 'K' for x in ax1.get_yticks()/1000])
ax1.annotate('Test R\u00b2: ' + str(round(oob_test_score,3)) + '\nTest MAE: ' + str(round(oob_mae)), xy=(0, 1), xytext=(25, -35),
xycoords='axes fraction', textcoords='offset points', fontsize=12)
ax1.set_title('Tuned Using Out-Of-Bag')
ax2 = plt.subplot(122)
ax2 = sns.scatterplot(x=y_test['real_price'], y=cv_predicted_test, edgecolors='w', alpha=0.9, s=8)
ax2.set_xlabel('Observed'), ax2.set_xticklabels(['{:,.0f}'.format(x) + 'K' for x in ax2.get_xticks()/1000])
ax2.set_ylabel('Predicted'), ax2.set_yticklabels(['{:,.0f}'.format(x) + 'K' for x in ax2.get_yticks()/1000])
ax2.annotate('Test R\u00b2: ' + str(round(cv_test_score,3)) + '\nTest MAE: ' + str(round(cv_mae)), xy=(0, 1), xytext=(25, -35),
xycoords='axes fraction', textcoords='offset points', fontsize=12)
ax2.set_title('Tuned Using Cross Validation')
plt.tight_layout(pad=0, rect=[0, 0, 0.9, 0.9])
plt.close()

Random Forest appears to be performing better than linear regression in terms of R square.
fig = plt.figure(figsize=(14,7))
ax1 = plt.subplot(121)
feat_imp = pd.DataFrame({'Features': X_train.columns, 'Feature Importance': rf.feature_importances_}).sort_values('Feature Importance', ascending=False)
sns.barplot(y='Features', x='Feature Importance', data=feat_imp)
#plt.xticks(rotation=45, ha='right')
ax1.set_title('OOB Feature Importance', size=15)
ax2 = plt.subplot(122)
feat_imp = pd.DataFrame({'Features': X_train.columns, 'Feature Importance': rfr.best_estimator_.feature_importances_}).sort_values('Feature Importance', ascending=False)
sns.barplot(y='Features', x='Feature Importance', data=feat_imp)
ax2.set_title('CV Feature Importance', size=15)
plt.tight_layout(pad=0, rect=[0, 0, 0.9, 0.9])
fig.show()
The feature importance are abit different from the linear regression model. Floor area and lease commence date are still one of the main drivers of resale prices. However, features like distance from Dhoby Ghaut MRT and flat type also appears to be good drivers.
Tree-based models seem to give lower importance to categorical values. This is due to the importance score being a measure of how often the feature was selected for splitting and how much gain in purity was achieved as a result of the selection.
We can also look at feature importance using SHAP (SHapley Additive exPlanations) values first proposed by Lundberg and Lee (2006) for model interpretability of any machine learning model. SHAP values have a few advantages:
Below shows the contributors to resale price for an example of low, middle and high resale price flats.
import shap
shap.initjs()
explainer = shap.TreeExplainer(rfr.best_estimator_)
shap_values = explainer.shap_values(X_test.iloc[[16]])
shap.force_plot(explainer.expected_value[0], shap_values[0], X_test.iloc[[16]])
explainer = shap.TreeExplainer(rfr.best_estimator_)
shap_values = explainer.shap_values(X_test.iloc[[5]])
shap.force_plot(explainer.expected_value[0], shap_values[0], X_test.iloc[[5]])
explainer = shap.TreeExplainer(rfr.best_estimator_)
shap_values = explainer.shap_values(X_test.iloc[[1]])
shap.force_plot(explainer.expected_value[0], shap_values[0], X_test.iloc[[1]])
explainer = shap.TreeExplainer(rfr.best_estimator_)
shap_values = explainer.shap_values(X_test.iloc[[100]])
shap.force_plot(explainer.expected_value[0], shap_values[0], X_test.iloc[[100]])
explainer = shap.TreeExplainer(rfr.best_estimator_)
shap_values = explainer.shap_values(X_test.iloc[[172]])
shap.force_plot(explainer.expected_value[0], shap_values[0], X_test.iloc[[172]])
explainer = shap.TreeExplainer(rfr.best_estimator_)
shap_values = explainer.shap_values(X_test.iloc[[10084]])
shap.force_plot(explainer.expected_value[0], shap_values[0], X_test.iloc[[10084]])
print("Flat Type Encoding = 2 ROOM:0, 3 ROOM:1, 4 ROOM:2, 5 ROOM:3, EXECUTIVE:4")
In this project, linear regression and random forest were used to looked at the drivers of HDB resale prices. Linear regression is powerful because it allows one to interpret the results of the model by looking at its coefficients for every feature. However, it assumes a linear relationship between the features and the outcome, which isn't always the case in real life. It also tends to suffer from bias due to its parametric nature. Conversely, non-parametric methods do not assume any function or shape, and random forest is a powerful non-linear machine learning model which uses bootstrap aggregating (bagging) and ensembling methods. A single decision tree has high variance as it tends to overfit to the data. Through bagging and ensembling, it is able to reduce the variance of each tree by combining them.
Looking at the output of the models, linear regression showed that regions, floor area, flat model, lease commencement date and distance from hawker are the top 5 drivers of HDB prices. However, random forest gave a slightly different result. floor area, and lease commencement date and distance from hawker still in the top 5 while distance from Dhoby Ghaut MRT and flat type has also came up on top. This happens as tree-based models tend to give lower importance to categorical variables (region and flat model) due to the way it computes importance.
Nevertheless, the size of the flat, lease date, and certain aspects of the location appear to be consistently the most important drivers of HDB resale prices.
lease commencement date from the linear regression model and rerun it again as it still has a high value of VIF, which could lead to a poor estimation of the coefficients of the model. Also, can try out different removal of features to see if VIF is still high, and then rerun the linear regression to see which set of features is the best. This will allow us to see whether features that are important in the random forest (distance from dhoby) are important using linear regression as well.